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Abstract 

Successful  military  operations  depend  on  the  aerobic  fitness  of  military  personnel.  Training 
programs  that  tax  the  cardiorespiratory  system  are  known  to  increase  aerobic  fitness,  and  program 
design  choices  influence  the  magnitude  of  these  gains.  This  review  attempted  to  identify  design 
choices  that  could  be  considered  best  practices.  A  best  practice  is  a  design  option  (such  as  training 
at  an  intensity  of  90%  of  one’s  maximum  heart  rate)  that  produces  significantly  better  results  than 
any  other  option  (e.g.,  training  at  60%).  To  this  end,  this  review  employed  meta-analytic 
techniques  to  synthesize  studies  that  investigated  the  design  options  that  determine  aerobic  fitness. 
To  ensure  sensitive  assessments  of  program  design  effects,  statistical  procedures  adjusted  for  the 
repeated  measures  structure  of  the  study  designs.  Unfit  individuals  benefitted  much  more  from 
training  than  fit  individuals.  Gender  and  age  were  not  influential  moderators.  Regarding  program 
design  options,  the  intensity  of  a  training  program,  the  duration  of  a  training  session,  the  frequency 
of  training  per  week,  and  the  length  of  a  training  program  were  all  significant  moderators. 
However,  with  the  exception  of  training  intensity,  post  hoc  comparisons  generally  showed  that  no 
single  design  option  was  significantly  better  than  all  others.  The  available  evidence  may  rule  out 
some  design  choices,  but  it  is  too  limited  to  identify  best  practices. 
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Summary 

Successful  military  operations  depend  on  the  aerobic  fitness  of  military  personnel.  Training 
programs  that  tax  the  cardiorespiratory  system  are  known  to  increase  aerobic  fitness,  and  program 
design  choices  influence  the  magnitude  of  these  gains.  This  review  attempted  to  identify  design 
choices  that  could  be  considered  best  practices. 

Issue 

Because  successful  military  operations  depend  on  the  aerobic  fitness  of  military  personnel,  effort 
must  be  devoted  to  the  design  of  optimal  training  programs.  Given  the  many  design  options 
available,  to  what  extent  does  the  current  literature  on  aerobic  training  prioritize  some  options  over 
others? 

Objective 

The  purpose  of  this  meta-analysis  is  to  integrate  results  across  several  studies  to  determine  the 
effects  of  many  factors  on  aerobic  fitness — such  as  those  that  relate  to  the  training  program  (e.g., 
frequency,  intensity,  duration,  and  mode  of  exercise),  in  addition  to  the  program  participants  (e.g., 
initial  fitness  level). 

Approach 

Statistics  describing  the  effects  of  training  on  aerobic  fitness  were  extracted  from  journal  articles. 
Every  study  included  in  this  review  employed  a  pretest-posttest  design.  To  detennine  the  effect 
size  due  to  training,  estimates  of  the  training  response  were  adjusted  for  the  type  of  research 
design.  Meta-regression  models  evaluated  potential  moderator  variables  (a  demographic  variable 
or  program  element  that  might  account  for  variation  in  the  overall  effect  size  due  to  training).  Best 
practices  were  evaluated  through  post  hoc  comparisons  between  different  levels  of  each  moderator 
variable. 

Results 

Unfit  individuals  benefitted  more  from  training  than  fit  individuals.  Gender  and  age  were  not 
influential  moderators.  The  intensity  of  a  training  program,  the  duration  of  a  training  session,  the 
frequency  of  training  per  week,  and  the  length  of  a  training  program  were  all  significant 
moderators.  However,  with  the  exception  of  training  intensity,  statistical  tests  generally  showed 
that  no  single  design  option  was  better  than  all  others.  The  available  evidence  may  rule  out  some 
design  choices,  but  it  is  too  limited  to  identify  best  practices. 
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Reviews  of  the  aerobic  training  literature  have  shown  that  training  increases  aerobic  fitness 
(Londoree,  1997;  Samitz  &  Bachl,  1991;  Wenger  &  Bell,  1986).  The  same  reviews  have 
established  that  aerobic  fitness  is  influenced  by  training  program  design.  With  the  effectiveness  of 
aerobic  training  well  established,  attention  shifts  to  the  question  of  whether  it  is  possible  to  identify 
best  practices  for  aerobic  training.  A  best  practice  is  a  specific  program  design  option  that  is 
superior  to  all  other  possible  choices  for  that  program  design  facet.  For  example,  the  intensity  of 
each  training  session  (measured,  for  instance,  in  terms  of  the  percentage  of  an  individual’s 
maximum  heart  rate)  is  a  program  design  facet.  If  the  cumulative  research  record  indicated  that  an 
intensity  equal  to  90%  of  one’s  maximum  heart  rate  produced  significantly  better  results  than  any 
other  choice  for  this  facet,  then  90%  of  the  maximum  heart  rate  would  be  a  best  practice. 

This  review  attempted  to  identify  best  practices  based  on  the  available  evidence,  by  focusing  on 
maximal  oxygen  uptake  (i.e.,  F02max)  as  the  key  dependent  variable  of  interest.  This  measure  is 

typically  considered  the  gold  standard  for  indexing  cardiorespiratory  fitness.  Previous  reviews 
have  attempted  to  estimate  the  effects  of  several  program  design  facets  on  aerobic  fitness,  such  as 
training  intensity  or  session  duration.  In  addition,  these  meta-analyses  may  be  viewed  as  also 
studying  best  practices,  but  they  have  done  so  only  indirectly,  and  have  not  formally  attempted  to 
identify  differences  in  design  facets  as  best  practices.  This  meta-analysis  explores  the  contribution 
of  these  factors  within  the  context  of  formally  identifying  best  practices. 

The  second  difference  between  this  review  and  prior  reviews  involved  the  treatment  of  statistical 
issues.  One  set  of  issues  derived  from  the  repeated  measures  structure  of  the  evidence.  Aerobic 
fitness  studies  routinely  employ  repeated  measures  research  designs.  Tests  of  aerobic  capacity  are 
administered  before  the  training  program  begins,  and  again  after  the  program  has  been  completed. 
The  difference  between  the  pre  and  post  training  scores  is  the  basis  for  estimating  the  effect  size 
(ES)  for  the  training  program.  Steps  must  be  taken  to  adjust  for  repeated  measures  experimental 
designs  when  estimating  a  study’s  ES  (Morris  &  DeShon,  2002). 

Another  statistical  issue  derived  directly  from  the  current  interest  in  identifying  best  practices.  It  is 
not  enough  to  demonstrate  that  program  design  choices  affect  the  size  of  the  training  response. 
Analysis  of  variance  (ANOVA)  tests  have  been  used  to  test  the  hypothesis  that  the  effects  of 
different  design  choices  are  the  same.  Rejecting  this  null  hypothesis  has  only  indicated  that  some 
options  differ  from  other  options.  It  is  not  enough  to  know  that  differences  between  options  exist. 
The  existence  of  differences  does  not  guarantee  the  existence  of  a  best  choice.  Thus,  a  significant 
ANOVA  must  be  followed  by  analyses  that  evaluate  differences  between  specific  program  design 
options.  This  review  employed  post  hoc  comparisons  to  determine  whether  the  design  option  that 
produced  the  largest  ES  was  truly  a  best  practice. 

This  review  attempts  to  identify  best  practices  for  several  design  facets  of  aerobic  training 
programs.  Statistical  methods  are  introduced  to  analyze  the  repeated  measures  structure  of  the  data 
and  the  need  for  post  hoc  comparisons  to  determine  whether  a  significant  moderator  effect  truly 
identifies  a  best  practice.  As  a  result,  this  review  provides  a  different  perspective  on  the  available 
evidence.  In  particular,  this  review  attempts  to  formally  identify  best  practices  for  the  program 
design  facets  of  training  intensity,  session  duration,  the  number  of  training  sessions  per  week,  the 
length  of  the  training  program,  the  type  of  exercise  (e.g.,  cycling  or  running),  and  the  type  of 
training  program  (interval  or  continuous).  The  search  for  best  practices  also  considered  initial 
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fitness  level  as  a  key  demographic  variable  that  might  influence  the  impact  of  different  program 
design  choices. 


Methods 


Literature  Search  Procedures 

The  initial  search  of  the  literature  centered  on  the  PubMed  database.  The  search  terms  included 
various  combinations  of  the  following:  “training,”  “aerobic  training,”  “aerobic  fitness,” 
“cardiorespiratory  fitness,”  “cardiorespiratory  training,”  “cardiovascular  training,”  “maximal 
oxygen  consumption,”  “V02max,”  and  “functional  capacity.”  The  search  produced  a  list  of  6,099 
candidate  articles.  This  list  was  narrowed  down  by  excluding  studies  with  participant  samples 
consisting  of  animals,  patients,  children,  adolescents,  and  those  who  were  obese  or  who  were 
diabetic  (the  exclusionary  criteria  are  discussed  in  greater  detail  below).  These  criteria  generated  a 
reduced  list  of  3,814  candidate  articles,  which  was  further  reduced  to  756  by  requiring  that  only 
articles  with  experimental  trials  be  included.  The  PubMed  abstracts  for  the  remaining  756  articles 
were  reviewed  to  determine  whether  they  met  the  inclusion  criteria  for  this  review.  Articles  were 
dropped  at  this  point  in  the  search  only  if  the  information  in  the  abstract  clearly  indicated  that  the 
study  failed  to  meet  at  least  one  of  the  criteria.  In  addition  to  the  previously  stated  exclusionary 
criteria,  subsequent  screening  required  the  studies  to  have  some  measure  of  F02max  ,  a  specific 
aerobic  training  program,  and  to  have  no  specialized  respiratory  treatment.  These  criteria  reduced 
the  list  to  25 1  articles.  In  addition  to  the  PubMed  search,  150  articles  that  contributed  data  to 
previous  aerobic  fitness  reviews  were  also  examined  (Londoree,  1997;  Samitz  &  Bachl,  1991; 
Wenger  &  Bell,  1986). 

The  full  texts  of  40 1  articles  that  passed  the  initial  screening  process  (25 1  from  the  PubMed 
search;  150  from  previous  reviews)  were  examined  to  determine  whether  the  studies  met  the 
following  criteria: 

1 .  Study  participants  were  required  to  be  healthy.  This  criterion  led  to  the  exclusion  of 
studies  whose  participants  were  hypertensive  or  diabetic  (among  other  conditions).  As  a 
general  rule,  a  study  was  excluded  if  the  study  participants  were  described  as 
“patients.”  However,  studies  of  “overweight”  individuals  were  accepted,  so  weight 
considerations  eliminated  only  studies  of  individuals  toward  the  upper  end  of  the  excess 
weight  range.  The  objective  in  making  these  exclusions  was  to  eliminate  studies  that 
might  produce  atypical  effects  because  of  limitations  on  the  ability  to  perform  training 
exercises,  and/or  that  involved  disease  and  metabolic  processes  that  might  modify  the 
training  response. 

2.  Study  participants  could  be  no  younger  than  16  nor  older  than  50  years  of  age.  This 
criterion  was  intended  to  restrict  the  study  samples  to  a  population  more  similar  to 
typical  military  personnel,  in  addition  to  minimizing  the  confounding  of  training  effects 
with  the  effects  of  normal  developmental  processes. 

3.  Maximal  oxygen  uptake  ( V02  max  )  was  expressed  in  milliliters  of  oxygen  per  kilogram 
per  minute  (i.e.,  ml-min  -kg'  ). 
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4.  The  study  reported  pre  and  post  training  measures  of  V02max  and  the  standard  deviations 

for  those  measurements.  This  infonnation  was  the  minimum  required  to  compute  ES 
when  combined  with  assumptions  about  the  magnitude  of  the  pretest-posttest 
correlation  (see  Appendix  A).  F02max  was  measured  more  than  twice  in  some  studies. 

When  this  was  the  case,  ES  was  computed  using  the  initial  and  final  measurements. 
Computing  the  effect  for  each  phase  of  the  training  programs  would  have  increased  the 
complexity  of  the  repeated  measures  problem.  Therefore,  ES  always  represented  the 
final  cumulative  training  impact. 

5.  The  training  program  was  endurance-based  rather  than  resistance-based. 

6.  Study  participants  were  given  no  medications  (e.g.,  beta  blockers,  such  as  propranolol). 
However,  for  those  studies  that  evaluated  the  effects  of  different  medications  on  aerobic 
capacity,  placebo  groups,  when  they  were  reported,  were  included  in  the  analysis. 

The  final  database  consisted  of  data  from  181  studies  that  met  the  inclusion  criteria.  Control 
groups  from  those  studies  were  excluded  from  the  review,  as  they  were  independent  groups  that 
participated  in  no  training  program.  With  this  restriction,  294  samples  provided  sufficient  data  to 
be  included  in  this  review.  The  cumulative  sample  size  was  3,382  study  participants. 

Table  1 


Sample  Characteristics 


k 

£N 

Mean 

SD 

Minimum 

Maximum 

Age 

290 

3342 

26.75 

7.49 

15.6 

50.5 

Height 

197 

2211 

173.80 

7.02 

154.1 

186.9 

Weight 

230 

2588 

72.63 

11.62 

48.9 

162.0 

Percent  body  fat 

81 

1026 

20.65 

6.20 

8.2 

36.0 

Note.  The  statistics  describe  the  population  of  study  samples  rather  than  a  population  of  individuals. 
The  data  were  not  weighted  for  the  computations  that  generated  these  descriptive  statistics. 


Demographic  and  Methodological  Variables 

Age,  height,  weight,  percent  body  fat.  Age,  height,  weight,  and  percent  body  fat  were  coded 
from  descriptive  statistics  reported  in  the  studies,  and  a  summary  of  these  data  is  provided  in  Table 
1.  Note  that  these  statistics  are  based  on  sample  means,  and  not  on  data  from  individuals. 

Gender.  For  most  studies,  the  samples  were  composed  entirely  of  men  or  entirely  of  women. 

Some  studies  consisted  of  samples  of  both  sexes,  and  a  few  provided  no  definite  information 
regarding  gender.  To  represent  this  variability,  gender  was  coded  as  men,  women,  or  men  and 
women  combined. 

Age.  For  most  studies,  the  average  age  of  study  participants  was  reported  separately  for  each 
independent  treatment  group  in  the  study.  For  other  studies  involving  multiple  independent  groups, 
only  the  overall  mean  age  was  provided.  When  this  was  the  case,  the  separate  groups  were 
assigned  the  overall  means.  An  age  range  (e.g.,  18  to  22  years)  was  another  common  reporting 
method.  Finally,  some  studies  did  not  report  age  directly,  but  provided  age-related  demographic 
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information  (e.g.,  university  students).  Qualitative  data  were  coded  based  on  judgments  of  the  age 
range  that  would  be  typical  of  the  group  described.  Age  was  classified  into  four  categories: 
younger  than  20,  20-29,  30-39,  and  40  and  older. 

Initial  Fitness  Level.  The  initial  fitness  levels  of  study  participants  were  inferred  from  their  overall 
initial  f02max  values.  The  initial  coding  employed  seven  categories:  very  poor,  poor,  fair,  average, 
good,  very  good,  and  excellent.  A  study  sample  was  classified  into  an  initial  fitness  level  category 
if  the  mean  initial  f02max  value  for  the  sample  participants  fell  within  a  particular  range  (usually  of 

3  to  4  V02max  units).  The  coding  scheme  incorporated  gender  and  age  differences.  Men,  in  general, 
were  associated  with  higher  f02max  values,  and  age  was  inversely  related  to  aerobic  capacity.  The 
specific  ranges  of  V02tmx  values  that  served  as  the  basis  for  coding  are  provided  in  Appendix  C. 

Program  Design  Facets 

Program  length.  Program  length  was  the  number  of  weeks  that  the  training  program  lasted. 

Intensity.  The  studies  varied  in  how  training  intensity  was  characterized.  In  most  cases,  intensity 
was  defined  in  terms  of  the  percentage  of  maximum  heart  rate,  V02max  percentage,  or  percentage  of 

heart  rate  reserve.  When  a  range  of  percentages  was  provided  (e.g.,  85%-95%  of  maximum  heart 
rate),  intensity  was  recorded  by  taking  the  midpoint  of  the  range.  When  multiple  ranges  were 
provided,  the  average  of  the  separate  midpoints  was  recorded.  Given  the  variation  in  how  intensity 
was  reported  across  studies,  a  common  classification  scheme  was  adopted,  based  on  Heyward 
(2006).  The  scheme  enables  classification  of  distinct  physiological  measurements  into  3 
categories:  (1)  moderate,  (2)  hard,  and  (3)  very  hard  to  maximal.  The  full  classification  scheme  is 
provided  in  Appendix  C. 

Duration  of  a  training  session.  This  variable  refers  to  the  duration  of  a  single  training  session, 
measured  in  minutes.  This  variable  was  coded  into  five  categories,  based  on  15  minute  intervals: 
less  than  15  minutes,  16-30,  31-45,  46-60,  and  61  and  greater. 

Frequency.  This  variable  refers  to  the  number  of  times  that  study  participants  trained  during  a 
given  week.  The  number  of  sessions  per  week  was  described  in  terms  five  categories:  1-2  sessions 
per  week,  3,  4,  5,  or  6  and  greater. 

Type  of  exercise.  This  variable  refers  to  the  type  of  exercise  adopted  during  the  training  program. 
The  type  of  exercise  typically  was  either  cycling  (most  often  on  a  cycle  ergometer),  or  running 
(including  jogging  or  walking,  either  on  a  track  or  treadmill).  In  some  cases,  a  training  program 
incorporated  both  cycling  and  running,  which  was  classified  as  a  distinct  category.  Some  studies 
included  other  kinds  of  exercises,  such  as  tennis  or  cross-country  skiing,  but  they  occurred 
individually  so  infrequently  that  they  were  classified  together  as  “other.” 

Type  of  training  program.  This  variable  refers  to  whether  the  training  program  involved 
intennittent  (i.e.,  work  periods  separated  by  rest  periods)  or  continuous  training. 
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Table  2  presents  the  distribution  of  the  demographic  variables  and  program  training  characteristics. 
For  each  variable,  the  table  provides  the  corresponding  number  of  samples  and  effect  sizes,  and 
number  of  participants  summed  across  samples. 

Table  2 


Distribution  of  Demographic  Variables  and  Program  Training  Characteristics 


No.  of 
samples 

No.  of 
ESs 

EN 

No.  of 
samples 

No.  of 
ESs 

EN 

Gender 

Duration  (minutes  per  session) 

Men 

86 

181 

2025 

<15 

10 

13 

180 

Women 

32 

68 

760 

16-30 

61 

104 

1221 

Men  and  Women 

22 

39 

541 

31-45 

65 

82 

966 

46-60 

22 

35 

364 

Age  group 

>61 

14 

19 

256 

<20 

19 

29 

307 

20-29 

117 

181 

1958 

Frequency  (sessions  per  week) 

30-39 

29 

52 

576 

1-2 

15 

20 

253 

>40 

22 

28 

501 

3 

80 

132 

1629 

4 

39 

58 

620 

Initial  fitness 

5 

30 

47 

524 

Very  poor 

4 

5 

60 

>6 

19 

25 

225 

Poor 

14 

15 

167 

Fair 

55 

73 

984 

Program  length  (in  weeks) 

Average 

79 

109 

1280 

1-4 

19 

26 

243 

Good 

34 

49 

482 

5-6 

20 

27 

287 

Very  good 

19 

25 

249 

7-8 

38 

65 

701 

Excellent 

14 

18 

160 

9-10 

38 

63 

728 

11-13 

29 

51 

577 

Intensity 

>  14 

37 

60 

827 

Moderate 

13 

14 

126 

Hard 

133 

195 

2309 

Activity 

Very  hard 

30 

42 

482 

cycling 

67 

92 

943 

Maximal 

5 

8 

59 

run/walk 

88 

152 

1800 

both 

10 

12 

117 

other 

31 

37 

508 
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Analysis  Procedures 

Every  study  included  in  this  review  employed  a  pretest-posttest  design.  For  this  reason,  methods 
described  by  Morris  and  DeShon  (2002)  were  applied  to  compute  appropriate  ESs  for  repeated 
measures  (simple  ESrm;  see  Appendix  A).  Meta-regression  models  evaluated  potential  moderator 
variables.  A  moderator  was  a  demographic  variable  or  a  program  element  that  might  account  for 
variation  in  ESrm-  The  meta-regression  analyses  applied  Hedges  and  Olkin’s  (1985)  general 
methods.  These  methods  included  weighted  ANOVA  and  weighted  linear  regression.  The  weight 
variable  was  the  inverse  of  the  estimated  variance  for  ESrm-  In  certain  cases,  not  all  of  the 
moderator  variable  levels  contained  comparable  data,  thus  separate  analyses  were  performed. 
However,  when  comparable  data  were  available,  analyses  with  combinations  of  independent 
variables  were  conducted. 

Moderators  were  evaluated  in  two  steps.  The  first  step  was  an  overall  test  for  a  moderator  effect  to 
determine  whether  the  ESrm  differed  significantly  across  the  levels  of  the  moderator  variable.  The 
second  step  was  taken  only  if  there  was  a  statistically  significant  moderator  effect.  Post  hoc 
comparisons  were  conducted  to  determine  which  groups  differed  significantly.  The  average  ESrm 
values  for  the  moderator  groups  were  ranked  from  largest  to  smallest.  The  group  with  the  largest 
average  ESrm  was  adopted  as  the  reference  group.  The  first  post  hoc  test  compared  the  reference 
group  to  the  group  with  the  second  largest  average  ESrm-  If  these  two  groups  differed 
significantly,  the  post-hoc  comparisons  stopped  at  this  point.  If  the  two  groups  did  not  differ 
significantly,  the  group  with  the  third-largest  average  was  compared  to  the  reference  group.  The 
comparisons  continued  down  the  ranked-ordered  moderator  groups  until  a  significant  difference 
was  found.  The  comparisons  stopped  at  that  point,  and  all  remaining  groups  were  classified  as 
differing  significantly  from  the  reference  group. 

Some  post  hoc  comparison  procedures  required  multiple  significance  tests.  Performing  multiple 
significance  tests  increases  the  probability  that  at  least  one  comparison  would  be  statistically 
significant  by  chance  alone.  A  Bonferroni  significance  criterion  was  adopted  to  fix  the  analysis¬ 
wide  probability  of  error  at  5%  or  less.  The  post  hoc  procedures  involved  j  -  1  comparisons  for  a 
moderator  with  j  levels.  The  Bonferroni  criterion  for  each  moderator  was  p  critical  =  -05/(/-l). 

The  post  hoc  comparisons  identified  equivalence  sets.  These  sets  consisted  of  the  design  option 
with  the  largest  average  effect,  plus  the  alternative  options  that  produced  effects  that  were  not 
significantly  different  from  this  reference  value.  The  sets  were  equivalent  in  the  sense  that  the 
alternative  options  in  the  set  could  not  be  confidently  classified  as  less  effective  than  the  optimum 
design  option  based  on  the  available  evidence. 

Large  samples  can  produce  significant  results  even  for  trivial  differences  (Rosenthal  &  Rosnow, 
1984).  To  avoid  mistaking  sample  size  for  explanatory  power,  the  Tucker-Lewis  index  (TLI; 
Tucker  &  Lewis,  1973)  was  adapted  to  provide  an  ES  index  for  the  moderator  analyses.  This  index 
is  the  proportion  of  the  greater-than-chance  variation  in  ESrm  accounted  for  by  a  moderator  or  set 
of  moderators  (see  Appendix  B).  Cohen’s  (1988)  ES  criteria  were  applied  to  characterize  the  TLI 
as  indicating  trivial,  small,  moderate,  or  large  moderator  effects. 
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Funnel  plots  were  constructed  to  evaluate  the  potential  effects  of  publication  bias  (Light  & 
Pillemer,  1984).  The  file-drawer  problem  was  not  examined  because  both  the  typical  ES  and  the 
total  number  of  studies  were  large.  Under  those  circumstances,  Rosenthal’s  (1979)  file  drawer 
criterion  would  almost  certainly  be  satisfied. 

Analyses  were  carried  out  with  the  SPSS-PC,  Version  17,  computer  program  (SPSS,  Inc.,  Chicago, 
IL)  and  R,  package  version  1.5.2.  (R  Development  Core  Team,  Vienna,  Austria). 

Results 

Program  Length  Effect 

ESrm  generally  increased  with  program  length.  Preliminary  analyses  of  the  association  between 
program  length  and  ESrm  compared  linear,  quadratic,  logarithmic,  power,  and  growth  models  as 
mathematical  representations  of  this  relationship.  The  logarithmic  model  given  as  Equation  1 
provides  the  best  prediction1  of  ESrm  (t  is  program  length,  in  weeks): 

ESrm  =  L10  +  .42  *  ln(t)  (1) 

The  graph  of  this  equation  is  given  in  Figure  1.  The  correlation  of  ESrm  with  program  length  was 
small  (r  =  .18),  but  statistically  significant  (%2  =  27.63,  1  df,p<  .001).  It  is  important  to  note  that 
the  linear  form  of  the  model  could  be  misleading.  The  intercept  (1.10)  might  be  mistakenly 
interpreted  as  indicating  that  ESRM  is  predicted  to  be  greater  than  0  prior  to  training  (at  t  =  0).  If 
this  equation  expressed  a  simple  linear  regression  of  ESrm  on  weeks  of  training,  then  this  would 
be  the  usual  interpretation  of  the  equation  intercept.  This  interpretation  is  misleading  because  the 
equation  takes  as  input  the  natural  logarithm  of  time  (number  of  weeks).  If  we  solve  the  equation 
for  an  ESrm  of  0,  the  estimated  time  to  produce  an  effect  of  this  size  is  .52  days.  This  estimate 
would  likely  correspond  to,  at  most,  one  training  session  in  a  typical  program. 


1  As  measured  by  the  Akaike  Information  Criterion  (AIC).  The  AIC  is  a  widely  used  tool  for  model  selection  that 
incorporates  the  fit  of  a  model  to  the  data  (specifically,  the  probability  of  the  data  given  the  model),  and  the 
complexity  of  the  model,  which  is  measured  by  the  number  of  parameters  in  the  model  and  is  incorporated  to  penalize 
overfitting. 
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Figure  1.  Gain  in  aerobic  training  response,  ESRM,  as  a  function  of  program  length. 

Initial  Moderator  Analyses 

The  overall  mean  ESRM  due  to  aerobic  training  was  2.03;  however,  the  analysis  revealed 
significant  heterogeneity  among  the  effect  sizes,  so  this  single  value  provides  a  poor  summary 
description  of  the  data.  A  more  accurate  description  will  rely  on  an  investigation  of  potential 
moderating  variables  that  can  help  explain  the  variation  between  studies.  To  this  end,  moderator 
analyses  were  conducted  treating  ESRM  as  the  dependent  variable. 

A  summary  of  the  initial  moderator  analyses  is  provided  in  Table  3,  which  includes  the  %2  and  TLI 
values  for  each  of  the  moderator  variables.  Table  4  (p.  16)  summarizes  the  program  design 
moderator  effects  in  terms  of  best  practices.  Table  4  provides  the  average  effect  sizes  for  different 
options  for  each  program  design  facet,  and  indicates  the  equivalence  sets  based  on  those  averages. 
The  program  design  facets  of  intensity,  frequency,  and  duration  were  significant  moderators,  as 
expected  from  previous  reviews,  whereas  program  type  (intennittent  or  continuous)  was  not. 
Exercise  type  was  a  statistically  significant  moderator,  but  the  TLI  value  indicated  that  the 
differences  were  trivial.  This  moderator  will  be  excluded  from  subsequent  analyses.  Of  the 
participant  characteristics,  gender  group  and  initial  fitness  level  were  statistically  significant,  with 
the  latter  being  a  particular  strong  moderator  of  ESRm-  However,  the  influence  of  gender  group  is 
driven  almost  entirely  by  study  samples  consisting  of  both  men  and  women;  samples  composed  of 
either  just  men  or  just  women  did  not  differ  significantly.  In  what  follows,  the  most  relevant 
moderator  variables  from  a  program  design  perspective  are  discussed  in  greater  detail.  In  addition, 
more  focused  moderator  analyses  will  be  conducted  that  take  into  account  the  prevalent  influence 
of  initial  fitness  level  as  a  key  demographic  moderator  variable. 


Aerobic  Training  and  Best  Practices  1 1 


Table  3 


Summary  of  Overall  Moderator  Analyses 


Moderator 

x2 

df 

Sig 

TEI 

Age 

4.50 

3 

0.213 

<.001 

Gender  group 

49.95 

2 

<0.001 

.065 

Men  vs.  women 

2.82 

1 

0.093 

<.001 

Initial  fitness  level 

104.84 

6 

<0.001 

.128 

Intensity 

44.65 

2 

<0.001 

.066 

Frequency 

45.84 

4 

<0.001 

.050 

Duration 

40.04 

4 

<0.001 

.045 

Program  length 

67.80 

6 

<0.001 

.080 

Exercise  type 

7.96 

3 

0.047 

<.001 

Type  of  program 

0.68 

2 

0.713 

<.001 

9 

Intensity.  The  intensity  of  a  training  program  was  a  statistically  significant  moderator  (%  =  44.65, 
2  df,p  <  .0001,  TLI  =  .07).  As  intensity  increased,  ESrm  increased  monotonically,  and  (almost) 
linearly.  Training  intensities  that  varied  from  very  hard  to  maximal  produced  the  largest  overall 
gain  (as  shown  in  Figure  2);  a  qualitative  pattern  that  is  consistent  with  previous  reviews.  Post  hoc 
comparisons  revealed  that  the  effect  size  for  the  most  intense  training  differed  significantly  from 
hard  intensity  (x2  =  22.00,  1  df,p<  .0001),  implicating  very  hard  to  maximal  intensity  as  a  best 
practice. 
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Figure  2.  Gain  in  aerobic  training  response  (as  indicated  by  the  ESRM  depicted  on  the  y-axis) 
as  a  function  of  the  intensity  of  the  training  program.  Each  level  of  the  independent  variable  is 
labeled  by  the  number  of  effect  sizes  ( k )  associated  with  that  level. 


2 

Frequency.  The  number  of  sessions  per  week  was  a  statistically  significant  moderator  (x  =  45.84, 
4  dfp  <  .0001,  TLI  =  .05).  This  preliminary  analysis  included  all  data  points,  but  the  average 
ESrm  for  1-2  days  per  week  exceeded  the  gains  of  any  other  number  of  days  per  week.  This 
counterintuitive  result  is  illustrated  in  Figure  3,  and  is  driven  by  two  very  large  ESrm  (>  4).  A 
boxplot  examination  of  these  data  suggests  that  they  are  mild  outliers  (that  is,  just  outside  the 
interquartile  range),  so  a  second  analysis  was  conducted  with  these  two  ESrm  removed.  With  these 
data  points  removed,  frequency  was  still  a  statistically  significant  moderator  (x  =  32.26,  4  df,p< 
.0001,  TLI  =  .03),  with  the  largest  ESRM  associated  with  4  sessions  per  week.  The  qualitative 
pattern  of  results,  shown  in  Figure  3,  is  consistent  with  previous  reviews  (i.e.,  Wenger  &  Bell, 
1986).  Post  hoc  comparisons  of  the  trimmed  data  revealed  that  4  sessions  per  week  differed 
significantly  from  3  sessions  per  week,  but  not  from  4,  5,  or  1-2  sessions  per  week  Care  should  be 
taken,  however,  in  evaluating  the  relative  benefits  of  training  1  to  2  times  per  week  versus  4  or  5 
times,  as  only  18  ESrm  were  included  in  the  1-2  sessions  group,  which  may  be  too  few  to  reach 
any  strong  statistical  conclusion. 


Aerobic  Training  and  Best  Practices  13 


1-2  3  4  5  6- 

k= 1 8  k= 1 32  k= 58  /c=47  /c=25 


Training  frequency 

(number  of  sessions  per  week) 

Error  bars  are  95%  confidence  intervals 

Figure  3.  Gain  in  aerobic  training  response  (as  indicated  by  the  ESRM  depicted  on  the  y-axis) 
as  a  function  of  frequency  of  training  (in  sessions  per  week). 

Duration.  The  duration  of  a  training  session  was  a  statistically  significant  moderator  (y  =  40.04,  4 
df,p  <  .001,  TLI  =  .05).  Training  durations  exceeding  an  hour  produced  the  largest  training 
response  ESrm,  but  this  value  differed  significantly  only  from  training  durations  of  less  than  15 
minutes  (x2  =  27.38,  1  df,p<  .001).  The  relationship  between  aerobic  training  gains  and  session 
duration  is  illustrated  in  Figure  4. 
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Figure  4.  Gain  in  aerobic  training  response  as  a  function  of  duration.  Each  level  of  the 
independent  variable  is  labeled  by  the  number  of  effect  sizes  ( k )  associated  with  that  level. 

2 

Initial  fitness  level.  The  initial  fitness  level  of  study  participants  was  a  key  moderator  variable  (y 
=  104.84,  6  df,  p  <  .0001,  TLI  =  .13).  As  shown  in  Figure  5,  though  gains  in  aerobic  capacity  were 
achieved  across  all  levels  of  initial  fitness,  the  largest  gains  were  obtained  by  study  participants  of 
relatively  modest  initial  fitness.  The  relationship  between  initial  fitness  level  and  aerobic  fitness 
gains  is  not  unexpected;  indeed,  the  same  relationship  has  been  found  for  resistance  training 
(Vickers,  Hervig,  &  Barnard,  unpublished  report). 

ESrm  generally  increased  with  program  length  for  both  unfit  (y2  =  53.49,  6  df,p<  .0001,  TLI  = 
.07)  and  fit  individuals  (y2  =  15.29,  6  df,  p  =  .02,  TLI  =.11).  The  logarithmic  models  for  unfit  and 
fit  individuals  are  given  below  as  Equations  2  and  3,  respectively: 

ESrm  =1.91  +  .22*ln(0 
ESRM=.70  +  .49*ln(0 

Moderator  Analyses  Adjusted  for  Initial  Fitness  Level 

The  bivariate  moderator  analyses  revealed  expected  effects  of  key  training  program  facets.  But  the 
strong  influence  of  initial  fitness  on  ESRM  suggests  that  a  more  focused  moderator  analysis  would 
hold  initial  fitness  level  constant.  To  this  end,  participants  were  divided  into  two  groups,  and 
analyses  were  carried  out  for  each  group  separately.  The  first  group  (N  =  202)  included  all 


(2) 

(3) 
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participants  classified  as  being  of  very  poor  to  average  fitness  (“unfit”),  and  the  second  (N  =  92) 
included  all  those  classified  as  being  of  good  or  better  fitness  (“fit”)  using  the  classification 
standards  found  in  Appendix  C.  To  preview,  with  the  exception  of  program  length,  significant 
moderator  effects  held  only  for  unfit  individuals;  accordingly,  a  detailed  summary  of  the  results  for 
unfit  individuals  is  provided  in  Table  5  (p.  19). 

Intensity.  Program  intensity  was  a  statistically  significant  moderator  for  unfit  individuals  (x~  = 
47.91,  2  dfp<  .0001,  TLI  =  .08),  but  not  for  fit  individuals  (x2  =  2.74,  2  df, p  =  .25).  As  in  the 
overall  analysis,  very  hard  to  maximal  training  intensities  produced  the  largest  overall  gain.  Post 
hoc  comparisons  revealed  that  the  effect  size  for  the  greatest  intensity  differed  significantly  from 
hard  intensity  (%  =  23. 16,  1  dfp  <  .0001),  implicating  very  hard  to  maximal  intensity  as  a  best 
practice  for  unfit  individuals. 


Initial  fitness  level 

Figure  5.  Gain  in  aerobic  training  response,  ESrm,  as  a  function  of  initial  fitness  level. 


Frequency.  The  number  of  sessions  per  week  was  a  statistically  significant  moderator  for  unfit 
individuals  (x2  =  34.39,  4  df,p<  .0001,  TLI  =  .  10),  but  not  for  fit  individuals  (x2  =  5.09,  4  df,p  = 
.28).  In  contrast  to  the  analysis  of  the  overall  results,  the  largest  gains  occurred  for  6  sessions  per 
week,  but  post  hoc  comparisons  revealed  that  this  value  differed  significantly  only  from  3  sessions 
per  week  (x2  =  7.76,  1  df,p  <  .01). 

Duration.  The  duration  of  a  training  session  was  a  statistically  significant  moderator  for  unfit 
individuals  (x2  =  61.12,  4  df,p  <  .001,  TLI  =  .10),  but  not  for  fit  individuals  (x2  =  2.93,  4  dfp  = 
.57).  As  in  the  overall  analysis,  training  durations  exceeding  an  hour  produced  the  largest  training 
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response  ESrm-  This  value  differed  significantly  from  the  second  largest  ESrm,  associated  with  15 
to  30  minutes  (y2  =  15.10,  1  df,p  <  .001),  implicating  durations  longer  than  an  hour  as  a  best 
practice.  However,  care  should  be  taken  in  the  interpretation  of  this  result.  Only  6  ESrm  were 
included  in  durations  longer  than  an  hour,  and  2  of  those  6  data  points  were  exceptionally  large 
(greater  than  4)  and  outside  the  interquartile  range,  suggesting  that  those  data  points  are  likely 
outliers.  Removing  those  data  points  reduces  the  ESrm  from  3.10  to  1.49,  highlighting  their 
influence  on  the  overall  analysis.  With  those  influential  data  points  removed,  duration  was  still  a 
statistically  significant  moderator  for  unfit  individuals  (y2  =  46.73,  4  df,  p  <  .00 1 ,  TLI  =  .08),  but 
with  session  durations  between  15  and  60  minutes  showing  the  largest  gains,  illustrated  in  Figure 
6.  Post  hoc  comparisons  revealed  that  the  average  ESrm  for  15  to  30  minutes  differed  significantly 
from  durations  exceeding  an  hour  (y  =  6.27,  1  df,  p  =  .01),  suggesting  that  15  to  60  minute 
durations  are  an  equivalence  class. 

Table  4 


Design  Facet  Moderator  Effects 


Moderator 

Level 

ESrm 

ka 

Equivalence 

setb 

Intensity 

Moderate 

1.09 

14 

Hard 

1.98 

195 

Very  hard  to 

2.43 

42 

maximal 

X  =44.65,  2  df, 
p  <  .0001,  TLI  =  .07 

Frequency 

1-2 

1.95 

18 

(number  of 
sessions  per  week) 

3 

1.85 

132 

4 

2.33 

58 

(4,  5,  1-2} 

5 

2.20 

47 

>6 

1.76 

25 

t  =  32.26,  4  df, 
p  <  .0001;  TLI  =  .04 

Duration 

<  15 

1.21 

13 

(minutes  per 
session) 

15-30 

2.17 

104 

{>61,  16-30, 
46-60,31-45} 

30-45 

2.01 

82 

45-60 

2.02 

35 

>60 

2.33 

11 

X  =  40.04,  4  df, 
p  <  .0001,  TLI  =  .05 

a“k'’  is  the  number  of  samples  that  provided  averages  for  analysis.  “The  equivalence  sets 
include  all  design  options  that  were  not  significantly  different  from  the  option  with  the  highest 
ESrm-  The  design  options  are  listed  from  largest  to  smallest  ESRM  in  the  set. 
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Figure  6.  Gain  in  aerobic  training  response,  ESrm,  for  unfit  individuals  as  a  function  of 
session  duration  (after  removing  influential  data  points  from  durations  exceeding  an 
hour).  Each  level  of  the  independent  variable  is  labeled  by  the  number  of  effect  sizes  ( k ) 
associated  with  that  level. 


Discussion 

Changes  in  aerobic  capacity  depend  on  several  factors,  some  related  to  the  characteristics  of 
individuals,  others  to  the  characteristics  of  the  training  programs.  Within  the  latter  set,  this  review 
confirms  previous  research  that  has  demonstrated  the  importance  of  intensity,  frequency,  duration, 
and  program  length  as  factors  that  contribute  to  aerobic  fitness.  However,  with  the  exception  of 
training  intensity,  this  review  has  not  identified  best  practices.  Program  design  facets  were 
statistically  significant  moderators  of  ESrm,  but  post  hoc  analyses  did  not  single  out  one  option  as 
significantly  better  than  all  others.  Failing  to  identify  best  practices  is  not  unique  to  this  review. 
Program  design  facets  often  are  statistically  significant  moderators  of  the  training  response,  but 
post  hoc  analyses  fail  to  identify  any  single  option  as  significantly  better  than  all  other  options. 
Given  this  general  trend,  the  current  findings  could  not  be  dismissed  as  resulting  from  the 
inclusion  criteria  or  analysis  procedures  that  have  been  employed  in  the  current  review.  Nor  should 
the  current  findings  indicate  that  perhaps  there  is  something  unique  about  aerobic  training  that 
precludes  the  identification  of  best  practices:  a  recent  meta-analysis  on  resistance  training 
(Vickers,  et  ah,  unpublished  report)  also  failed  to  identify  best  practices  using  the  same  criterion. 
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The  statistical  methods  adopted  in  this  review  should  have  sharpened  the  contrasts  between  design 
options.  Specifically,  repeated  measures  analyses  are  expected  to  produce  larger  effect  sizes,  since 
repeated  measures  designs  give  rise  to  smaller  sampling  variability.  The  smaller  sampling 
variability  amplifies  differences  between  the  average  ESrm  values  for  different  design  options  in 
post  hoc  analyses,  so  the  current  procedure  should  have  increased  the  likelihood  of  finding  a  best 
practice. 

The  failure  to  identify  best  practices  does  not  mean  that  such  practices  do  not  exist.  Every  analysis 
produced  one  option  that  had  a  larger  ESrm  than  all  other  options  for  that  facet.  The  problem  was 
that  the  differences  between  the  most  promising  option  and  other  choices  were  not  large  enough  to 
be  statistically  significant.  Although  the  comparisons  have  not  been  reported  in  detail  here,  many 
post  hoc  comparisons  produced  very  small  %  values  despite  moderately  large  sample  sizes.  The 
implication  is  that  the  available  evidence  would  have  to  be  multiplied  many  times  to  make  the 
contrasts  between  the  design  options  statistically  significant.  If  the  required  data  were  available, 
the  conclusion  still  might  be  that  the  differences  were  too  small  to  be  important.  It  is  debatable 
whether  the  extensive  additional  research  needed  to  clearly  define  best  practices  would  really  have 
much  impact  on  program  design  choices. 

A  low  probability  of  identifying  best  practices  at  any  time  in  the  near  future  does  not  mean  that 
aerobic  training  research  fails  to  offer  any  advice  on  training  program  design.  Aerobic  training 
research  helps  to  single  out  some  design  options  as  less  effective  than  others.  While  the  typical 
equivalence  set  included  more  than  one  option,  it  is  also  true  that  it  seldom  contained  all  possible 
options.  Given  the  available  evidence,  trends  in  the  data  that  are  corroborated  across  different 
reviews  may  suggest  sound  practical  guidelines,  subject  to  constraints  that  a  program  coordinator 
might  face  (e.g.,  the  cost,  in  terms  of  dollars  or  time,  to  implement  one  facet  instead  of  another). 

As  a  guideline  for  future  studies,  it  may  be  more  productive  to  conduct  research  to  rule  out  some 
options — focusing  on  what  is  reasonable,  given  what  we  do  know,  than  on  what  is  best,  absent 
what  is  almost  impossible  to  know. 

Finally,  comparing  the  results  of  this  meta-analysis  to  a  recent  review  of  the  resistance  training 
literature  (Vickers,  et  al.,  unpublished  report)  could  potentially  yield  general  training  principles 
that  can  help  to  inform  reasonable  expectations  for  any  physical  training  program.  For  example,  in 
this  meta-analysis  and  in  Vickers,  et  al.,  unfit  individuals  showed  significantly  greater 
improvement  than  fit  individuals.  Also,  the  rate  of  improvement  for  both  aerobic  and  resistance 
training  followed  a  similar  growth  pattern,  one  best  described  mathematically  as  a  logarithmic 
function. 

Specifically,  after  comparing  the  growth  patterns  associated  with  aerobic  and  resistance  training, 
some  striking  similarities  emerge.  First,  the  direction  and  strength  of  the  relationship  between 
program  length  and  ESrm  are  nearly  identical  across  training  types  (r=  .18  and  r  =  .21  for  aerobic 
and  resistance  training,  respectively).  Second,  the  best  statistical  model  relating  program  length 
and  ESrm  shares  an  identical  structural  form  and  similar  set  of  estimated  parameters  across 
training  types.  In  particular,  these  equations  are: 

ESrm  =  1 . 10  +  .42  *  ln(t)  (Aerobic  training;  4) 

ESrm  =  0.41  +  .55  *  ln(r)  (Resistance  training;  5) 
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Table  5 


Design  Facet  Moderator  Effects  for  Unfit  Individuals 


Moderator 

Level 

ESrm 

k 

Equivalence 

set 

Intensity 

Moderate 

1.13 

11 

Hard 

2.11 

142 

Very  hard  to 

2.63 

35 

maximal 

X2  =47.91, 2  df 
p  <  .0001,  TLI  =  .08 

Frequency 

1-2 

2.22 

12 

(>6,  4,5,  1-2} 

(sessions  per 
week) 

3 

1.94 

97 

4 

2.53 

43 

5 

2.24 

38 

>6 

2.60 

9 

X2  =  34.39  df 
p  <  .0001,  TLI  =  .05 

Duration  (in 

<  15 

1.13 

10 

(16-30,31-45, 

minutes) 

16-30 

2.26 

88 

46-60} 

31-45 

2.24 

58 

46-60 

2.21 

20 

>61 

3.10/1 .49a 

6/4a 

X2  =  61.12,  4  df 
p  <  .0001,  TLI  =  .10 

X2  =  46.73,  4  df 
p  <  .0001,  TLI  =  .08b 

Program  length 

1-2 

1.07 

1 

{9-10,  3-4, 

(in  weeks) 

3-4 

2.61 

6 

>14} 

5-6 

1.97 

21 

7-8 

1.99 

45 

9-10 

2.68 

45 

11-13 

1.96 

41 

>  14 

2.41 

37 

X2  =53.49,  6  df 
p  <  .0001,  TLI=  .07 

aLatter  values  are  the  ESrm  and  sample  size  for  the  group  after  removing  influential  data  points. 
bSummary  statistics  for  the  analysis  after  removing  influential  data  points  from  the  “>61”  group. 
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The  equations  share  similar  rates  of  change  in  ESrm,  implying  that  aerobic  and  resistance  gains 
accrue  at  roughly  the  same  rate.  The  equations  differ  primarily  with  respect  to  the  intercept,  an 
additive  constant.  To  determine  what  this  difference  implies,  note  again  that  the  intercept  here  does 
not  have  the  usual  interpretation  found  in  linear  regression  (i.e.,  the  value  of  the  dependent  variable 
when  the  independent  variable  is  set  to  zero).  Since  the  predictor  is  transformed  logarithmically, 
one  way  to  interpret  the  intercept  is  to  solve  the  equation  for  ESrm  =  0.  The  solution  is  an  estimate 
of  the  number  of  sessions  expected  to  produce  no  training  effect.  For  resistance  training,  the 
number  of  sessions  producing  no  effect  corresponds  to  1  or  2  in  a  typical  program;  for  aerobic 
training,  the  number  of  sessions  corresponds  to  one  at  most.  In  other  words,  aerobic  training  could 
be  expected  to  return  measurable  gains  a  bit  earlier  in  a  typical  program  than  strength  training,  but 
over  time  the  overall  rate  of  return  on  training  would  differ  only  slightly  between  the  two  program 
types. 

Practical  Recommendations? 

The  primary  aim  of  this  review  has  been  to  determine  whether  best  practices  exist  for  different 
program  design  facets.  It  is  important  to  note  that  the  criterion  for  what  counts  as  a  best  practice 
was  a  stringent  one,  and  that  in  many  cases  there  may  simply  have  been  too  little  power  to  detect  a 
potentially  important  difference.  While  the  review  could  end  here,  treating  the  null  results  as  a  call 
for  further  research,  it  is  important  not  to  lose  sight  of  what  reviews  such  as  these  are  intended  to 
achieve;  namely,  informed  suggestions  for  guidelines  given  the  available  evidence. 

But  with  no  clear  evidence  that  best  practices  exist  for  most  design  facets,  what  procedures  are 
available  to  translate  effect  size  estimates  into  practical  guidelines?  One  way  is  to  translate  the 
results  into  a  more  intuitively  meaningful  measure  that  could  provide  a  secure  basis  (if  not  ideal) 
for  program  design  choices.  This  review  has  focused  on  average  individual  change  for  a  given 
study  sample,  and  employed  ESrm  as  the  measure  of  improvement  in  aerobic  fitness.  While  this 
measure  is  appropriate  for  meta-analyses  that  focus  on  individual  change  (and  employ  a  pretest- 
posttest  design),  the  results  can  sometimes  be  difficult  to  interpret. 

For  example,  how  much  better  is  a  training  facet  that  yields  an  ESrm  of  2.46  compared  to  one  that 
yields  2.25?  It  may  be  the  case  that  the  difference  is  statistically  significant,  which  would 
implicate  the  former  facet  as  a  better  practice  than  the  latter,  but  how  much  should  we  read  into 
such  a  difference?2 3 

One  way  to  answer  this  question  is  to  translate  effect  sizes  into  an  estimated  percentage  of  the 
trained  population  that  would  be  expected  to  improve  their  cardiorespiratory  fitness.  For  example, 
an  ESrm  of  .65  implies  that  the  change  would  be  positive  for  74%  of  program  participants  (Morris 
&  DeShon,  2002). 3  For  2.46  and  2.25,  the  estimates  are  99.3%  and  98.8%,  suggesting  that  we 
should  not  read  too  much  into  the  difference  between  the  effect  sizes.  The  main  drawback  to  this 


2  A  similar  problem  has  been  addressed  earlier  in  the  review,  which  motivated  the  use  of  the  TLI  to  estimate  the 
importance  of  a  statistically  significant  finding  (see  Appendix  B  for  details).  However,  the  goal  of  this  section  is  to 
connect  the  results  of  the  meta-analysis  to  practical  guidelines,  and  the  TLI  does  not  admit  a  natural  interpretation  that 
would  address  this  problem. 

3  Assuming  normally  distributed  data.  See  Morris  and  DeShon  (2002)  for  a  more  detailed  discussion  of  this 
assumption. 
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approach  is  that  it  only  estimates  the  percentage  of  participants  expected  to  improve,  but  not  by 
how  much,  relative  to  those  who  had  not  trained  at  all.  Since  some  improvement  would  be 
expected  in  response  to  any  training,  this  strategy  may  be  uninformative.  Indeed,  for  nearly  all 
training  facets  presented  in  Tables  4  and  5,  estimated  percentages  ranged  from  96%-99%;  the  only 
exceptions  were  moderate  intensity  training  (86%  expected  to  improve),  and  training  for  less  than 
15  minutes  per  session  (89%  expected  to  improve). 

The  main  limitation  of  the  previous  method  is  due  to  the  fact  that  it  relies  on  an  effect  size  that  is 
based  on  the  variability  of  change  scores,  which  focuses  on  individual  change  and  not  on  the 
relationship  between  a  trained  group  and  an  untrained  group.  Thus,  another  way  to  answer  the 
question  is  to  shift  the  research  focus  to  an  analysis  based  on  score  variability  within  the  separate 
groups.  The  latter  analysis  enables  us  to  make  statements  about  the  average  performance  of  one 
group  relative  to  the  other.  Assuming  that  the  populations  are  normally  distributed  with  equal 
variance,  we  can  then  translate  effect  size  estimates  into  percentile  rankings  (e.g.,  the  average 
performance  after  training  at  a  hard  intensity  was  greater  than  81%  of  the  no-training  population). 

To  shift  from  a  focus  on  individual  change  scores  to  a  comparison  between  groups,  the  ESrm  was 
converted  to  the  effect  size  for  independent  groups  (ESig)  (Morris  &  DeShon,  2002).  For  the  ESrm 
of  2.46  and  2.25  the  corresponding  ESig  are  1.10  and  1.01  (the  conversion  formula — Equation 
A3 — and  its  meaning  are  provided  in  Appendix  A).  An  ESig  of  1 . 10  means  that  the  average 
performance  of  the  trained  group  was  better  than  86%  of  the  untrained  population;  an  ESig  of  1.01 
means  that  the  average  performance  of  the  trained  group  was  better  than  84%  of  the  untrained 
population.  As  these  numbers  suggest,  perhaps  we  should  not  read  too  much  into  the  difference 
between  the  groups.  The  estimated  percentages  for  the  main  program  design  facets  are  provided  in 
Table  6. 

These  percentages  are  intended  to  complement  the  more  stringent  statistical  definition  of  a  “best 
practice.”  However,  despite  improvement  in  the  interpretability  of  the  results,  open  questions 
remain.  While  it  may  be  apparent  that  it  is  worthwhile  to  design  a  training  program  with  at  least  a 
hard  intensity  level  (given  the  14%  relative  increase  over  moderate  intensity  training),  it  is  less 
clear  in  other  cases.  Would  the  choice  for  a  very  hard  to  maximal  training  intensity  be  warranted, 
given  the  5%  relative  increase,  but  also  greater  risk  for  injury?  This  and  related  questions  are 
beyond  the  scope  of  this  review — they  entail  understanding  the  physical  fitness  demands  expected 
for  particular  occupations,  and  the  larger  economic  and  environment  contexts  in  which  training 
will  occur.  Ideally,  these  percentages  would  be  judged  relative  to  occupational  physical  fitness 
standards.  Future  research  should  be  directed  towards  understanding  how  to  relate  standardized 
measures  of  aerobic  gains  to  the  physical  demands  of  specific  occupations. 
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Table  6 


Design  Facet  Moderator  Effects  for  All  and  Unfit  Individuals 


Moderator 

Level 

All 

individuals3 

Unfit 

individuals'3 

Intensity 

Moderate 

69% 

69% 

Hard 

81% 

83% 

Very  hard  to 
maximal 

86% 

88% 

Frequency 

1-2 

81% 

84% 

(number  of  sessions  per  week) 

3 

80% 

81% 

4 

85% 

87% 

5 

84% 

84% 

>6 

78% 

88% 

Duration 

<  15 

71% 

69% 

(minutes  per  session) 

16-30 

83% 

84% 

31-45 

82% 

84% 

46-60 

82% 

84% 

>61 

85% 

75% 

Program  length 

1-2 

67% 

68% 

(in  weeks) 

3-4 

76% 

88% 

5-6 

80% 

81% 

7-8 

80% 

81% 

9-10 

86% 

88% 

11-13 

80% 

81% 

>  14 

85% 

86% 

a  Percentile  ranking  that  the  average  trained  individual  would  have  in  the  untrained 
population  (data  are  from  all  participants — no  distinction  in  fitness  made).  b  Percentile 
ranking  that  the  average  trained  individual  would  have  in  the  untrained  population  (data  are 
only  from  unfit  individuals). 

As  a  further  guide  toward  practical  recommendations,  it  should  be  noted  that  the  conclusions  of 
this  analysis  are  generally  consistent  with  the  recommendations  of  the  American  College  of  Sports 
Medicine  (ACSM).  For  healthy  adults  under  the  age  of  65,  the  ACSM  recommends  moderate  to 
intense  training,  20  to  30  minutes  a  day,  3  to  5  days  a  week.  In  addition,  the  ACSM  emphasizes  the 
point  that  physical  activity  exceeding  the  basic  recommendations  provides  even  greater  health 
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benefits  (a  finding  corroborated  by  this  meta-analysis).  Comparing  the  ACSM’s  recommendations 
to  the  results  of  this  meta-analysis  highlights  broad  commonalities,  with  increasing  levels  of 
intensity  yielding  larger  results,  the  greatest  gains  from  weekly  frequency  occurring  at  4  sessions 
per  week,  and  session  durations  between  20  to  30  minutes  returning  the  largest  absolute  gain.4  But 
an  important  advantage  of  this  meta-analysis  over  the  ACSM  and  previous  meta-analyses  is  that  it 
provides  quantitative  estimates  of  the  relative  expected  gains  from  several  design  facets.  While 
design  options  could  not  be  distinguished  in  most  cases  on  the  basis  of  best  practices,  these 
estimates  will  play  an  important  role  in  developing  statistical  models  for  predicting  expected 
aerobic  gain,  given  a  set  of  chosen  design  features. 


4  There  are  essentially  no  differences  in  fitness  gains  among  the  duration  groups  of  15-30  minutes,  30-45,  and  45-60 
minutes. 
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Appendix  A 

Computing  Effect  Sizes  for  Repeated  Measures 

Meta-analysis  provides  estimates  of  the  average  ES,  and  the  variation  of  individual  ES  estimates 
about  that  average.  The  homogeneity  tests  for  variation  about  the  average  are  especially  important 
in  the  present  context.  If  the  ESs  for  different  training  programs  display  greater-than-chance 
variation,  it  is  reasonable  to  search  for  moderator  variables  that  can  explain  the  observed 
heterogeneity.  In  the  present  review,  program  design  facets  and  demographic  variables  were  of 
interest  as  potential  moderator  variables. 

Studies  must  be  assigned  appropriate  weights  to  compute  the  average  ES  and  test  for  variation 
about  the  average.  The  weights  are  based  on  the  precision  of  the  individual  ES  estimates.  All 
studies  reviewed  here  employed  pretest-posttest  research  designs.  In  such  cases,  the  correlation  of 
pretest  scores  with  posttest  scores  affects  the  sampling  variance  that  is  the  index  of  precision  for 
the  ES  estimate.  Therefore,  the  pretest-posttest  correlation  must  be  known  to  derive  sampling 
variance  estimates  that  are  suitable  for  determining  ES  weights.  The  correlation  must  be  known 
whether  the  analyses  employ  standardized  mean  change  scores  or  difference  scores  (Morris,  2000). 
For  change  scores,  the  proper  estimate  of  sample  variance  is: 

°Diff2  =°i2  +°22  -2raiC2.  (Al) 

In  this  equation,  the  subscripted  “Diff  ’  indicates  that  the  variable  of  interest  is  a  difference  score. 
The  pretest-posttest  correlation,  r,  is  expected  to  be  positive  and  moderate  to  large.  As  a 
consequence,  the  last  tenn  of  Equation  Al  will  be  moderate  to  large  relative  to  the  first  two  terms. 
It  follows  that  simply  pooling  the  pretest  and  posttest  variances,  as  would  be  the  case  if  the  pretest- 
posttest  correlation  was  ignored,  will  result  in  overestimation  of  the  true  sampling  variance.  If  the 
variance  is  overestimated,  the  z-scores  associated  with  the  deviation  of  specific  ES  values  from  the 
average  ES  will  be  smaller  than  they  would  be  if  the  correct  variance  were  used.  The  overall  test 
for  homogeneity  of  ESs,  Cochran’s  Q,  is  the  sum  of  the  squared  z-scores.  Thus,  overestimating 
sampling  variance  will  lead  to  underestimating  Q.  This  bias  in  the  0-test  values  could  lead 
erroneously  to  the  conclusion  that  a  given  moderator  is  unimportant.  The  tests  for  moderators  were 
central  to  this  review,  so  accurate  variance  estimates  were  essential. 

The  correct  variance  estimates  could  be  estimated  easily  if  studies  routinely  reported  the  pre 
training/post  training  correlations  for  test  scores.  Unfortunately,  this  information  is  seldom 
reported.  The  required  information  could  be  extracted  from  the  /-test  or  F-test  if  either  statistic  was 
reported  separately  for  each  condition  in  the  study.  Once  again,  aerobic  training  studies  seldom 
provide  this  information. 

After  developing  pretest-posttest  correlation  estimates,  the  analysis  followed  guidelines  provided 
by  Morris  and  DeShon  (2002).  First,  the  variance  for  individual  observations  was  computed  by 
applying  Equation  Al  above.  Second,  the  standard  deviation  of  the  differences  (SDdiff)  was 
computed  by  taking  the  square  root  of  the  variance.  This  standard  deviation  was  used  to  compute 
the  initial  ESrm  (Equation  A2).  A  separate  ES  was  computed  for  each  record  in  the  data  file.  A 
record  consisted  of  the  results  for  a  single  aerobic  training  program  administered  to  a  particular 
sample  of  subjects. 
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The  use  of  an  average  pretest-posttest  correlation  will  obviously  be  inaccurate  in  many  cases. 
However,  these  correlations  clearly  have  been  positive  and  substantial  when  estimates  have  been 
available.  Ignoring  this  strong  trend  would  lead  to  very  conservative  tests  for  moderator  effects. 
The  uncertainty  introduced  by  the  use  of  average  values  was  preferable  to  having  results  that 
certainly  were  too  conservative. 

The  estimated  pretest-posttest  correlation  values  were  combined  with  the  sample  standard 
deviations  to  compute  the  variance  of  the  difference  scores  as  shown  in  Equation  Al.  SDdifr  was 
the  square  root  of  this  variance.  The  ES  for  repeated  measures  was 

ESrm  =  (Mean post  -  Meanpre )/SDdiff .  (A2) 

Equation  A2  depends  on  the  estimated  variability  associated  with  the  change  scores  (i.e.,  the 
SDdiff).  The  ESrm  is  appropriate  in  cases  in  which  one  is  interested  in  the  average  improvement 
due  to  a  training  program  (measured  in  standard  deviation  units)  above  zero  (Morris  &  DeShon, 
2002).  However,  if  one  is  interested  in  the  average  perfonnance  due  to  training  relative  to  the 
average  performance  without  training,  then  the  variability  associated  with  the  separate  groups  is 
used  as  the  basis  for  the  effect  size  calculation.  The  formula  for  converting  an  ES  for  repeated 
measures  to  an  ES  for  independent  groups  was 

ES1G  =  ESRMA/2(l-p)  .  (A3) 


Weighting  ESrm  Estimates 

Individual  ES  estimates  must  be  weighted  to  obtain  the  most  precise  aggregated  ESrm  estimate 
and  to  test  for  heterogeneity  in  the  individual  estimates.  The  appropriate  weights  are  the  inverse  of 
the  variance.  The  variance  of  an  individual  ESrm  estimate  can  be  computed  by  applying  the 
equation  for  the  single-group  pretest-posttest  change  score  variance  fonnula  in  Table  2  of  Morris 
and  DeShon  (2002,  p.  1 17). 


Variance  = 


n  - 1 


(l  +  «5RM2)- 


f 


v 


§rm21 


(A4) 


In  this  equation,  n  is  sample  size  and  SRM  is  the  population  value  for  ESrm-  The  equation  includes 
a  bias  correction,  c,  to  obtain  accurate  variance  estimates.  This  correction  factor  was  obtained  by 
applying  the  approximation  developed  by  Hedges  (1982),  and  given  as  Equation  23  in  Morris  and 
DeShon  (2002,  p.  117) 


(4  *df)-\ 


(A5) 


The  variance  computations  required  one  additional  input,  5rm-  Ideally,  this  parameter  would  be  set 
equal  to  the  unknown  population  ES.  An  estimate  of  this  population  parameter,  dRM,  was  used 
because  the  population  value  can  only  be  estimated  once  after  the  variance  is  already  known. 


Aerobic  Training  and  Best  Practices  26 


Given  this  circularity,  the  recommended  solution  is  to  compute  the  unweighted  ES  and  use  that 
value  for  computing  the  variance  for  ESrm  (Hedges,  1982;  Morris  &  DeShon,  2002).  The  present 
analyses  employed  this  approach. 

The  variance  of  ESRM  was  computed  by  applying  Equation  A4  after  estimating  5rm.  The 
derivation  of  that  equation  can  be  found  in  the  appendix  to  Morris  and  DeShon  (2002),  or  in 
Gibbons,  Hedeker,  and  Davis  (1993). 
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Appendix  B 
Tucker-Lewis  Index 


The  TLI  (Tucker  &  Lewis,  1973)  was  introduced  to  guard  against  what  may  be  a  wide-spread 
problem  in  meta-analysis.  Moderator  analyses  begin  with  a  significance  test.  If  the  null  hypothesis 
is  rejected,  the  moderator  variable  is  accepted  as  a  meaningful  influence  on  ES  even  if  the 
differences  between  groups  are  quite  small. 

Relying  on  significance  tests  to  identity  important  results  is  a  risky  proposition  in  any  statistical 
analysis.  Statistical  significance  is  the  product  of  sample  size  and  ES  (Rosenthal  &  Rosnow,  1984). 
In  meta-regression,  ES  might  be  labeled  “meta-ES”  because  it  reflects  the  differences  in  the 
primary  ES  across  moderator  groups.  A  significant  meta-ES  could  indicate  a  substantial  between- 
groups  difference,  but  it  does  not  rule  out  the  possibility  that  small  between-groups  difference  have 
been  amplified  by  a  large  sample  size.  Although  it  follows  that  the  meta-ES  must  be  separated 
from  sample  size  to  properly  interpret  findings,  this  principle  is  not  routinely  applied  to  meta¬ 
analysis  even  though  logic  says  it  should  be. 

The  TLI  was  adapted  to  provide  a  meta-ES  metric.  The  TLI,  which  is  the  proportion  of  greater- 
than-chance  variation  in  ES,  can  be  computed  from  the  y2  values  from  a  moderator  analysis.  The 
variation  in  ESrm  detennines  the  y2  values.  The  TLI  equation  is 


TLI  = 


( XnuII  Null)  (x  Model  l  df  Model ) 


( Xnuii  I df mil)  1 


(Bl) 


The  expected  value  of  %/df  ratio  is  1 ,  so  the  denominator  of  Equation  B 1  is  the  proportion  of  the 
observed  variation  in  ESrm  that  is  greater  than  expected  by  chance.  The  numerator  is  the  variation 
in  ESrm  accounted  for  by  the  model,  i.e.,  total  ESrm  variation  minus  the  residual  ESrm  variation 
after  fitting  the  model.  The  TLI  is  a  reasonable  index  of  the  meta-regression  ES,  and  maintains  a 
connection  between  effect  size  and  the  probability  that  a  moderator  will  be  statistically  significant. 


The  TLI  is  not  an  exact  parallel  to  the  usual  effect  size  indicators  such  as  the  proportion  of 
variance  explained  in  an  ANOVA.  One  reason  is  that  the  TLI  is  analogous  to  Hays’  (1963)  co2 
rather  than  the  usual  s“.  The  difference  between  the  two  is  that  the  variance  that  would  be  expected 
by  chance  is  subtracted  from  the  variance  explained  when  computing  co2,  but  not  when  computing 

s2.  This  difference  is  the  reason  that  i  will  be  less  than  zero  when  ZMwle[2  /  dfModel  >  %mu  ! dfNuu 

2 

because  the  numerator  will  be  a  negative  number.  This  situation  arises  when  the  reduction  in  the  y 
produced  by  a  model  is  small  relative  to  the  number  of  parameters  in  the  model.  For  this  reason, 
the  reported  TLI  is  the  value  derived  from  Equation  Bl  or  .00,  whichever  is  larger. 


The  interpretation  of  the  TLI  employed  Cohen’s  (1988)  general  criteria  for  ES  evaluations. 
Cohen’s  criteria  classify  ESs  on  the  basis  of  the  proportion  of  observed  variation  explained  by  a 
predictor.  In  this  case,  TLI  is  the  proportion  of  non-random  variation  in  ESrm,  so  Cohen’s  (1988) 
ES  classification  rule  is  a  suitable  index  for  characterizing  the  strength  of  association  of  moderator 
variables  with  ESrm:  small  meta-ES,  .01  <  TLI  <  .10;  moderate  meta-ES,  .10  <  TLI  <  .25;  large 
meta-ES,  TLI  >  .25. 
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Appendix  C 

Coding  schemes  for  intensity  and  initial  fitness  level 

The  classification  schemes  for  levels  of  intensity  (for  distinct  physiological  and  behavioral 
measurements),  and  initial  fitness  level  are  presented  below.  For  intensity,  the  columns  represent 
percentage  of  V02  reserve  (%V02R)  and  percentage  of  heart  rate  reserve  (%HRR);  percentage  of 

maximum  heart  rate  (%HRmax);  percentage  of  V(j2  max  (%  V 02max );  and  ratings  of  perceived 
exertion  (RPE).  This  classification  scheme  is  based  on  Heyward  (2006). 

The  initial  fitness  codes  are  based  on  pre  training  V02max  values,  and  vary  depending  on  age  and 
gender  (Source:  http://preventdisease.com/news/articles/vo2_max_how_fit_athlete.shtml). 


Tabled 

Intensity  coding  scheme 


Intensity 

%V02R 
or  %HRR 

%HR 

max 

%V02max 

RPE 

Very  light 

<20 

<35 

<1 

<10 

Light 

20-39 

35-54 

2-27 

10-11 

Moderate 

40-59 

55-69 

28-50 

12-13 

Hard 

60-84 

70-89 

51-81 

14-16 

Very  hard 

85+ 

90+ 

82+ 

17-19 

Maximal 

100 

100 

98 

20 

Table  C2 

Initial  Fitness  Level  coding  scheme 


Age 

(years) 

Very  poor 

Men  Women 

Poor 

Men  Women 

Fair 

Men  Women 

Average 

Men  Women 

20-24 

<32 

<27 

32-37 

27-31 

38-43 

32-36 

44-50 

37-41 

25-29 

<31 

<26 

31-35 

26-30 

36-42 

31-35 

43-48 

36-40 

30-34 

<29 

<25 

29-34 

25-29 

35-40 

30-33 

41-45 

34-37 

35-39 

<28 

<24 

28-32 

24-27 

33-38 

28-31 

39-43 

32-35 

40-44 

<26 

<22 

26-31 

22-25 

32-35 

26-29 

36-41 

30-33 

45-49 

<25 

<21 

25-29 

21-23 

30-34 

24-27 

35-39 

28-31 

50-54 

<24 

<19 

24-27 

19-22 

28-32 

23-25 

33-36 

26-29 

Note.  Ranges  are  pre-training  F02max  values. 
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Table  C2 

Initial  Fitness  Level  coding  scheme 


Age 

(years) 

Good 

Men  Women 

Very  good 

Men  Women 

Excellent 

Men  Women 

20-24 

51-56 

42-46 

57-62 

47-51 

>62 

>51 

25-29 

49-53 

41-44 

54-59 

45-49 

>59 

>49 

30-34 

46-51 

38-42 

52-56 

43-46 

>56 

>46 

35-39 

44-48 

36-40 

49-54 

41-44 

>54 

>44 

40-44 

42-46 

34-37 

47-51 

38-41 

>51 

>41 

45-49 

40-43 

32-35 

44-48 

36-38 

>48 

>38 

50-54 

37-41 

30-32 

42-46 

33-36 

>46 

>36 
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